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We analyze a science collaboration network, i.e. a network whose nodes are sci- 
entists with edges connecting them for each paper published together. Furthermore 
we develop a model for the simulation of discontiguous small-world networks that 
shows good coherence with the empirical data. 
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1 Introduction 

Hearing the term network, the first association coming into one's mind are physically 
wired networks as telephone or computer nets. 

However, network does also denominate the same concept on an universal level: nodes 
of whatever type connected by links determined by relations of the most different kinds. 
Mathematically spoken, we often call networks graphs, nodes vertices and links edges. 

Thus, there are numerable different kinds of networks, physical ones (e.g. hard wired) 
as well as logical (e.g. dependencies) or social ones (e.g. contacts, friendships), stretching 
out to topics far from wired networks 0, Q|- The area is under vigorous research. Good 
reviews can be found in (3, 0, 0, @| - 

As a crucial difference from true random graphs 0, 0] , Lots of human created networks 
show the small world effect 0,0,0], i.e. the average path length between two random 
points is significantly shorter (behaves logarithmically with the system size) than that 
of random networks (linear with system size). 

A model developed by Barabasi and Alber t 11211 d escribes several small-world networks 
very well, included e.g. the world wide web |l3L Il4j|- 

In the context of science the network between scientists as nodes of the graph is of 
particular interest. This network belongs to the group of social ones, with humans as 



nodes. Unlike most other forms of social relationships, that are quite difficult to capture 
objectively, the field of published papers is very widespread covered by the Science 
CtiaUon Index and so easily available to research. 



2 Collaboration networks 

2.1 Typology 

Apart from linking not the scientists but their papers there are basically two possible 
choices in what to consider a link between two authors — both covered equally by the 
database. 



1. We chose to consider citations from one author to another as links Jjj. In this 
case we get a rapidly growing number of papers we have to consider, as each added 
scientist cites several others and so on. It even is not clear a priori that we ever 
get to an end — except in the case of having all scientists in our set of authors. 

2. The only connections in our net are those of co-authorship in one or several papers 
jlil . Now we have the advantage of being able to chose an arbitrary set 
of authors as start of our examinations — though we should provide a preferably 
reasonable one. This choice will be discussed later. 

But another problem arises using this method: it is very difficult — if not 
impossible — to determine all papers a given pair of authors ever published. 



2.2 Building the net 

As a solution we chose the following proceeding: We start with one paper. As one part 
of our work will deal with Barabasi-Albert networks, we take the corresponding paper 
[l2l | as center of our investigation. 

To determine the set of authors we want to deal with we select all 185 papers that cite 
this paper. (Remark: we have to be careful not to mix citation data from different dates 
as new papers are continuously added to the database. The base of our investigation is 
October 21 st , 2002.) 

As a second step we construct a list of unique authors from all these papers. In a first 
approach we have 559 scientists, whereof some turn out to be identical only appearing in 
particular papers with typos. We finish with a set of 555 authors to whom we attribute 
consecutive numbers. 

The last step of the network creation consists in establishing links between all these 
authors. This is done by selecting one paper after the other and introducing a connection 
between each possible pair of this paper's authors. 

Eventually, this gives us a net which we suppose to be rather typical for scientific 
collaboration. 

The network size is relatively small compared to all data in the Science Citation Index 
(approx. 10 7 papers). We will study properties of this subnet and compare it to some 
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classical and recent network models hoping to get an idea of what leads to the structure 
we observe. Verification with bigger networks is a task for the future. 



2.3 Statistics 

Now we will analyze a crucial property of the just created net: the cluster size distri- 
bution. In our case, clusters of scientists are formed by the links between them, i.e. a 
single paper of n authors already forms a cluster of size n. 

Immediately, our eyes are caught by a paper on the Human Genome Project 20] with 
274 authors. The giant cluster thus formed is singled out from all others by its hugeness. 
Regarding it as an anomaly, we chose to remove it from our network. A brief examination 
yields that this is no harm as the scientists participating in this work did not cooperate 
with the others of our study and form a big cluster containing only themselves. We will 
discuss later in section TA.'A. II if this proceeding was justified. 

Now, we investigate the frequency of clusters of a given size. Our expectation is to 
see many clusters with few authors and vice versa. 
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Figure 1: frequency distribution of the cluster size 
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The experimental data (figure ^) shows this behavior but with one surprise: although 
the most frequent cluster size is 2 due to a big number of publications with two authors, 
most scientists maintain collaboration with three others. 

A probable explanation is this: scientists are often member of research groups in- 
volved in different themes, thus connecting different clusters formed by single two-author- 
papers. 

3 Computer simulation 

We now describe a model reproducing these results, hoping to understand how this 
macroscopic behaviour arises by microsopic decisions of the individuals. 

A model describing small-world networks quite well is that of Barabasi and Albert 
. Unfortunately, it only deals with networks consisting of one single component. We 
generalize it to discontiguos networks. 

3.1 Standard Barabasi-Albert model 

The Barabasi-Albert network model starts with a set of (mostly) tuq = 3 points, each 
connected to each other by a link. 

In each time step, (usually) m = 3 new nodes are added, each with a link to the 
existing network. The probability of an already existing vertex to be a linking target is 
proportional to the number of connections already present at this node. "The rich get 
richer." 

3.2 Modified model 

To cope with networks consisting of several components, we have to modify the model. 
We chose a very simple approach: In each step of adding nodes we start a new network 
of mo = 3 nodes with at certain probability p. 

Vertices added in consecutive time steps can connect to any node in any component 
respecting the same probability rule as in the standard model. 

In the case m = 1, components can only grow (isolated clusters), whereas in the case 
m > 1, new nodes are able to connect two or more existing components of the network 
(merging clusters). 

3.3 Results 

To compare the results with the data collected in section 12.21 we let the network grow 
up to the same size of 555 nodes. This is repeated 10 times for statistical reasons. 

3.3.1 Isolated clusters 

In the case m = 1, i.e. the case isolated clusters, we can be sure to get scale- free behavior 
within the distinct clusters, as the probabilities for attachment of a new node to an ex- 
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isting one are the same as in a single Barabasi- Albert network (modulo a proportionality 
factor due to a new node having a "choice" between different clusters to connect to). 

However, the complete network does not necessarily have to be scale-free, as the total 
statistics is a sum of multiple scale-free sub-networks or clusters. 
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Figure 2: Frequency of clusters vs. cluster size at different probabilities for a new net. 

Simulation was run 10 4 times with a network growing up to 555 nodes. The 
curve for p = 0.01 is the one with the rightmost peak; to the left follow the 
other p- values in ascending order. 

First, we examine the number of clusters of different sizes (figure [2J). 

Obviously, a high probability of starting a new net leads to many smaller networks, 
whereas a low one privileges bigger networks. Yet, we make an interesting observation: 
low probabilities lead to a cluster-size distribution that is not monotonic any more, but 
favors big networks. 

The explanation is straight-forward: For p = we will see a graph oc 5(555), as there is 
only one giant cluster, for p = 1 a graph oc 8(m,Q = 3), because there are only embryonic 
sub-nets. What we observe for < p < 1 is the transition between both extremes. 

For all p we start with a power law region regarding the distribution for small and 
medium cluster sizes. The exponent varies with the network-birth probability p. In 
figure 01 we analyze this correlation in a semi- logarithmic plot. 
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Figure 3: Negative exponent of the power law part of the curves in figure|2]vs. probability 



p for a new net. The line corresponds to exponent 



3 2.25p 



We find that — e 2,25p describes our data rather well. Of course, this formula cannot be 
true for general p as for p — > 1 we expect m — > — oo! 

Finally, we compare the cluster distribution from computer simulation with real-world 
data (figure and are surprised. The data fits well — including the giant cluster we 
thought to be an anomaly whilst building the network. The plot gives strong evidence 
that indeed it was an organic part of the network. Its hugeness is simply due to the 
graph forming rules. 

3.3.2 Merging clusters 

Now, we modify the model by examining m > 1. In this case, newly added vertices 
develop several links to existing nodes (and thus existing clusters), being able to connect 
hitherto separated networks. In this paper, we limit our considerations to the standard 
Barabasi- Albert case m = mo = 3. 

Using different p, we quickly recognize that low and medium probabilities make the 
simulation nearly always end up with a single giant cluster containing all vertices. Points 
of interest are higher p in the region of 60-90%. 
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Figure 4: Comparison of the simulation with p = 0.02 using the isolated clusters model of 
figure El and statistical data from a science collaboration network. To simplify 
comparison, relative frequencies are used. 



What about scale-free behavior in this case? In figure |S] we can see that there is no 
pure scale-free behavior. There seems to be power-law behavior for small degrees and 
an exponential cutoff at higher values as has been verified by a semi-logarithmic plot of 
figure 03 Similar results have been observed by Newman for collaboration networks. 

One could argue that this effect is due to the fact that we do not plot the degree 
distribution for single clusters but for the whole set of them. This demur only counts 
at first sight, though. At p = 80% we have several small clusters but virtually only 
one giant cluster dominating the degree distribution for high degrees. So, the fact of 
averaging of many different sized clusters should manifest mainly in the area of small 
degrees opposite to our observations. 

For small cluster sizes, we observe a non-uniform behavior regarding the frequency of 
clusters of a given size. There is no monotony of the sort that larger clusters are less 
probable than small ones. 

The explanation is as follows: newly born clusters have a size of mo = 3 and thus 
appear very often. Also, cluster of sizes 4 or 7 are very probable, whereas a cluster of 
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Figure 5: Frequency of nodes with a certain degree. Simulation was run 10 4 times with 
a network growing up to 555 nodes. 



size 5 is very rare, because it can only be formed by a new cluster to which two new 
ones have connected without glueing it to a second cluster. 

In a semi-logarithmic plot, we find a parabolic dependence for high cluster sizes (i. e. 
a Gaussian distribution around a mean depending on p) . 

4 Conclusion 

We constructed a network of coauthership with 555 authors. Only scientists were chosen 
that cite a specific paper ^3). We find a cluster size distribution showing an exponential 
decay for small cluster sizes and a giant cluster that cannot be explained by common 
network models. 

A change of the model of Barabasi and Albert Q by allowing a certain probability 
to start new clusters, enables us to simulate networks consisting of distinct clusters, e.g. 
friendship or collaborational networks. 

The modification provides two different models. In the isolated clusters variant, newly 
born clusters stay distinct forever and show up scale-free behavior on their own. 
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This model is able to explain facts formerly regarded as statistical anomalies as the ob- 
servation of a giant cluster of a size exceeding largely all others in the network. Following 
the simulation this observation fits very well (figure 0J). 

Merging clusters is the second variant, i.e. new nodes are able to merge existing clus- 
ters. This model shows an exponential fall-off for higher degrees in the degree distribu- 
tion and thus no pure scale-free behavior. Furthermore, it results mainly in a Gausssian 
distribution of cluster sizes for bigg er clusters and thus cannot cope with reality. 

More results will be given in [2l|. 
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